Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations
نویسندگان
چکیده
While todays orthography is very strict and seldom changes, this has not always been true. In historical texts spelling of words often not only varies from todays but in some periods even varies from use to use in a single text. Information retrieval on historical corpora can deal with these variations using fuzzy matching techniques based on Levenshtein-Distance using stochastic weights. In particular by using the noisy channel model of (3) and the simple algorithm they give. The algorithm, they use for spell checking, adapted to the problem of information retrieval of historical words, with queries in modern spelling, uses stochastic weights, learned from training pairs of modern and historical spelling. Using these weights shows an improvement over standard Levenshtein-Distance in the F-Score. The preparation of the training pairs usually depends on manual work. To avoid this work we devised an unsupervised algorithm for obtaining the training pairs.
منابع مشابه
Learning a Spelling Error Model from Search Query Logs
Applying the noisy channel model to search query spelling correction requires an error model and a language model. Typically, the error model relies on a weighted string edit distance measure. The weights can be learned from pairs of misspelled words and their corrections. This paper investigates using the Expectation Maximization algorithm to learn edit distance weights directly from search qu...
متن کاملUnsupervised learning of edit parameters for matching name variants
Since named entities are often written in different ways, question answering (QA) and other language processing tasks stand to benefit from entity matching. We address the problem of finding equivalent person names in unstructured text. Our approach is a generalization of spelling correction: We compare to candidate matches by applying a set of edits to an input name. We introduce a novel unsup...
متن کاملFuzzy lexical matching
Being able to automatically correct spelling errors is useful in cases where the set of documents is too vast to involve human interaction. In this bachelor's thesis, we investigate an implementation that attempts to perform such corrections using a lexicon and edit distance measure. We compare the familiar Levenshtein and Damerau-Levenshtein distances to modi cations where each edit operation ...
متن کاملBootstrapping Morphological Analysis of Gı̃kũyũ Using Unsupervised Maximum Entropy Learning
This paper describes a proof-of-the-principle experiment in which maximum entropy learning is used for the automatic induction of shallow morphological features for the resourcescarce Bantu language of Gı̃kũyũ. This novel approach circumvents the limitations of typical unsupervised morphological induction methods that employ minimum-edit distance metrics to establish morphological similarity bet...
متن کاملBootstrapping morphological analysis of gĩkũyũ using unsupervised maximum entropy learning
This paper describes a proof-of-the-principle experiment in which maximum entropy learning is used for the automatic induction of shallow morphological features for the resourcescarce Bantu language of Gı̃kũyũ. This novel approach circumvents the limitations of typical unsupervised morphological induction methods that employ minimum-edit distance metrics to establish morphological similarity bet...
متن کامل